{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
    "\n",
    "<i>Licensed under the MIT License.</i>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Neural Collaborative Filtering on MovieLens dataset.\n",
    "\n",
    "Neural Collaborative Filtering (NCF) is a well known recommendation algorithm that generalizes the matrix factorization problem with multi-layer perceptron. \n",
    "\n",
    "This notebook provides an example of how to utilize and evaluate NCF implementation in the `reco_utils`. We use a smaller dataset in this example to run NCF efficiently with GPU acceleration on a [Data Science Virtual Machine](https://azure.microsoft.com/en-gb/services/virtual-machines/data-science-virtual-machines/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "System version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) \n",
      "[GCC 7.3.0]\n",
      "Pandas version: 0.24.2\n",
      "Tensorflow version: 1.12.0\n"
     ]
    }
   ],
   "source": [
    "import sys\n",
    "sys.path.append(\"../../\")\n",
    "import pandas as pd\n",
    "import tensorflow as tf\n",
    "tf.get_logger().setLevel('ERROR') # only show error messages\n",
    "\n",
    "from reco_utils.common.timer import Timer\n",
    "from reco_utils.recommender.ncf.ncf_singlenode import NCF\n",
    "from reco_utils.recommender.ncf.dataset import Dataset as NCFDataset\n",
    "from reco_utils.dataset import movielens\n",
    "from reco_utils.common.notebook_utils import is_jupyter\n",
    "from reco_utils.dataset.python_splitters import python_chrono_split\n",
    "from reco_utils.evaluation.python_evaluation import (rmse, mae, rsquared, exp_var, map_at_k, ndcg_at_k, precision_at_k, \n",
    "                                                     recall_at_k, get_top_k_items)\n",
    "\n",
    "print(\"System version: {}\".format(sys.version))\n",
    "print(\"Pandas version: {}\".format(pd.__version__))\n",
    "print(\"Tensorflow version: {}\".format(tf.__version__))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set the default parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "# top k items to recommend\n",
    "TOP_K = 10\n",
    "\n",
    "# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
    "MOVIELENS_DATA_SIZE = '100k'\n",
    "\n",
    "# Model parameters\n",
    "EPOCHS = 50\n",
    "BATCH_SIZE = 256\n",
    "\n",
    "SEED = 42"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Download the MovieLens dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "4.93MB [00:00, 13.9MB/s]                            \n"
     ]
    }
   ],
   "source": [
    "df = movielens.load_pandas_df(\n",
    "    size=MOVIELENS_DATA_SIZE,\n",
    "    header=[\"userID\", \"itemID\", \"rating\", \"timestamp\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Split the data using the Spark chronological splitter provided in utilities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "train, test = python_chrono_split(df, 0.75)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generate an NCF dataset object from the data subsets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = NCFDataset(train=train, test=test, seed=SEED)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Train the NCF model on the training data, and get the top-k recommendations for our testing data\n",
    "\n",
    "NCF accepts implicit feedback and generates prospensity of items to be recommended to users in the scale of 0 to 1. A recommended item list can then be generated based on the scores. Note that this quickstart notebook is using a smaller number of epochs to reduce time for training. As a consequence, the model performance will be slighlty deteriorated. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = NCF (\n",
    "    n_users=data.n_users, \n",
    "    n_items=data.n_items,\n",
    "    model_type=\"NeuMF\",\n",
    "    n_factors=4,\n",
    "    layer_sizes=[16,8,4],\n",
    "    n_epochs=EPOCHS,\n",
    "    batch_size=BATCH_SIZE,\n",
    "    learning_rate=1e-3,\n",
    "    verbose=10,\n",
    "    seed=SEED\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Took 331.9876825809479 seconds for training.\n"
     ]
    }
   ],
   "source": [
    "with Timer() as train_time:\n",
    "    model.fit(data)\n",
    "\n",
    "print(\"Took {} seconds for training.\".format(train_time))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the movie recommendation use case scenario, seen movies are not recommended to the users."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Took 2.8517026901245117 seconds for prediction.\n"
     ]
    }
   ],
   "source": [
    "with Timer() as test_time:\n",
    "    users, items, preds = [], [], []\n",
    "    item = list(train.itemID.unique())\n",
    "    for user in train.userID.unique():\n",
    "        user = [user] * len(item) \n",
    "        users.extend(user)\n",
    "        items.extend(item)\n",
    "        preds.extend(list(model.predict(user, item, is_list=True)))\n",
    "\n",
    "    all_predictions = pd.DataFrame(data={\"userID\": users, \"itemID\":items, \"prediction\":preds})\n",
    "\n",
    "    merged = pd.merge(train, all_predictions, on=[\"userID\", \"itemID\"], how=\"outer\")\n",
    "    all_predictions = merged[merged.rating.isnull()].drop('rating', axis=1)\n",
    "\n",
    "print(\"Took {} seconds for prediction.\".format(test_time))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Evaluate how well NCF performs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The ranking metrics are used for evaluation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MAP:\t0.048986\n",
      "NDCG:\t0.198938\n",
      "Precision@K:\t0.178897\n",
      "Recall@K:\t0.098781\n"
     ]
    }
   ],
   "source": [
    "eval_map = map_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)\n",
    "eval_ndcg = ndcg_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)\n",
    "eval_precision = precision_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)\n",
    "eval_recall = recall_at_k(test, all_predictions, col_prediction='prediction', k=TOP_K)\n",
    "\n",
    "print(\"MAP:\\t%f\" % eval_map,\n",
    "      \"NDCG:\\t%f\" % eval_ndcg,\n",
    "      \"Precision@K:\\t%f\" % eval_precision,\n",
    "      \"Recall@K:\\t%f\" % eval_recall, sep='\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if is_jupyter():\n",
    "    # Record results with papermill for tests\n",
    "    import papermill as pm\n",
    "    import scrapbook as sb\n",
    "    sb.glue(\"map\", eval_map)\n",
    "    sb.glue(\"ndcg\", eval_ndcg)\n",
    "    sb.glue(\"precision\", eval_precision)\n",
    "    sb.glue(\"recall\", eval_recall)\n",
    "    sb.glue(\"train_time\", train_time.interval)\n",
    "    sb.glue(\"test_time\", test_time.interval)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python (reco_gpu)",
   "language": "python",
   "name": "reco_gpu"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
